Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MultiPandasIndex helper class #7182

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Oct 18, 2022

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

This PR adds a xarray.indexes.MultiPandasIndex helper class for building custom, meta-indexes that encapsulate multiple PandasIndex instances. Unlike PandasMultiIndex, the meta-index classes inheriting from this helper class may encapsulate loosely coupled (pandas) indexes, with coordinates of arbitrary dimensions (each coordinate must be 1-dimensional but an Xarray index may be created from coordinates with differing dimensions).

Early prototype in this notebook

TODO / TO FIX:

  • How to allow custom __init__ options in subclasses be passed to all the type(self)(new_indexes) calls inside the MultiPandasIndex "base" class? This could be done via **kwargs passed through... However, mypy will certainly complain (Liskov Substitution Principle).
  • Is MultiPandasIndex a good name for this helper class?


return type(self)(new_indexes)

def copy(self: T_MultiPandasIndex, deep: bool = True) -> T_MultiPandasIndex:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be updated following #7140.

@keewis
Copy link
Collaborator

keewis commented Oct 18, 2022

I wonder if it is possible to create a generic MultiIndex? Something like

ds.set_xindex(
    ["a", "b"],
    MultiIndex([("a", PandasIndex), ("b", PandasIndex), (["a", "b"], BallTreeIndex)),
)

(but I'm sure we can find a better syntax... maybe create a hashable sequence that is not a tuple, since that is already taken? and change the list of tuples to a dict).

For that concept to work, the return value of MultiIndex would have to be a factory function / class, which would instantiate the actual index.

What do you think?


seen = set()
dup_dims = []
for d in dims:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since dims is a dict, shouldn't the keys be unique? I don't think you will have repetitions here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sure 🙂. But we should probably check for duplicate dimensions so a dict is not what we want for dims.

Copy link
Collaborator

@headtr1ck headtr1ck Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just do

dim_names = [idx.dim for idk in indexes]
if len(dim_names) != len(set(dim_names)): raise...


__slots__ = ("indexes", "dims")

def __init__(self, indexes: Mapping[Hashable, PandasIndex]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always better to use Any for Mapping inputs, due to covariance.
Use Mapping[Any, PandasIndex]

def _get_unmatched_names(
self: T_MultiPandasIndex, other: T_MultiPandasIndex
) -> set:
return set(self.indexes).symmetric_difference(other.indexes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always find the operator notation easier to read: set(self.indexes) ^ set(other.indexes)


def _get_unmatched_names(
self: T_MultiPandasIndex, other: T_MultiPandasIndex
) -> set:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use set[Hashable]


return type(self)(new_indexes)

def copy(self: T_MultiPandasIndex, deep: bool = True) -> T_MultiPandasIndex:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well implement the correct deep-copy behavior which passes the memo dict.

else:
return type(self)(new_indexes)

def sel(self, labels: dict[Any, Any], **kwargs) -> IndexSelResult:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Mapping[Any, Any]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be called internally in Xarray (always with a dict), but maybe there are some cases where one wants to call it directly so Mapping[Any, Any] would be better..

@benbovy
Copy link
Member Author

benbovy commented Oct 18, 2022

I wonder if it is possible to create a generic MultiIndex?

Hmm that could be possible but it think there are just too many possible edge cases for something generic like that.

In your specific example

ds.set_xindex(
    ["a", "b"],
    MultiIndex([("a", PandasIndex), ("b", PandasIndex), (["a", "b"], BallTreeIndex)),
)

we could probably use the BallTreeIndex for point-wise indexing (i.e., with ds.sel(a=xr.DataArray(...), b=xr.DataArray(...))) and use the two PandasIndex instances for other kinds of selection (e.g., with slices, scalars, etc.) so there's no conflict, but I doubt this would be what we want in other cases.

I guess your suggestion is a way around the constraint in the Xarray data model that a coordinate cannot have multiple indexes? I'm afraid there's no easy solution that is generic enough. Maybe some cache to avoid rebuilding the indexes? I.e., .set_xindex() doesn't drop the pre-existing index(es) but rather disable them so that it is possible to re-enable them later with another .set_xindex() call (.xindexes only returns the "active" indexes but there may be other "inactive" indexes attached to a dataset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants